The Demand for a Sound Baseline in GPU Memory Architecture Research

نویسندگان

Hongwen Dai

Chao Li

Zhen Lin

Huiyang Zhou

چکیده

Modern GPUs adopt massive multithreading and multi-level cache hierarchies to hide long operation latencies, especially off-chip memory access latencies. However, poor cache indexing and cache line allocation policy as well as a small number of miss-status handling registers (MSHRs) can exacerbate the problem of cache thrashing and cache-missrelated resource congestion. Besides, modulo address mapping among memory partitions may cause severe partition camping, resulting in underutilization of DRAM bandwidth and capacity of banked L2 cache. Furthermore, prior GPU cache bypassing studies unrealistically assume there is no limit on the number of in-flight bypassed requests, which may lead to pathological experimental results in simulation. In this work, we investigate the performance impact of the aforementioned factors and demonstrate the necessity for a sound baseline in GPU memory architecture research. Our results show that advanced cache indexing functions can greatly reduce conflict misses and improve cache efficiency; the allocation-on-fill policy brings a better performance than allocation-on-miss. Besides, the performance does not consistently improve with more MSHRs. Instead, there can be performance degradation in certain scenarios. In addition, Xor mapping can greatly mitigate the problem of memory partition camping. Furthermore, the fact that a limited number of inflight bypassed requests can be supported should be taken into account in GPU cache bypassing studies, for more reliable results and conclusions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Embedded Memory Test Strategies and Repair

The demand of self-testing proportionally increases with memory size in System on Chip (SoC). SoC architecture normally occupies the majority of its area by memories. Due to increase in density of embedded memories, there is a need of self-testing mechanism in SoC design. Therefore, this research study focuses on this problem and introduces a smooth solution for self-testing. In the proposed m...

متن کامل

An approach to Improve Particle Swarm Optimization Algorithm Using CUDA

The time consumption in solving computationally heavy problems has always been a concern for computer programmers. Due to simplicity of its implementation, the PSO (Particle Swarm Optimization) is a suitable meta-heuristic algorithm for solving computationally heavy problems. However, despite the simplicity, the algorithm is inefficient for solving real computationally heavy problems but the pr...

متن کامل

Ultra-Low-Energy DSP Processor Design for Many-Core Parallel Applications

Background and Objectives: Digital signal processors are widely used in energy constrained applications in which battery lifetime is a critical concern. Accordingly, designing ultra-low-energy processors is a major concern. In this work and in the first step, we propose a sub-threshold DSP processor. Methods: As our baseline architecture, we use a modified version of an existing ultra-low-power...

متن کامل

The Effectiveness of Training of Reading Assistant Package on Dyslexic childrens’ Working Memory - a Multiple Baseline Single Case Study

Introduction: ln literature review, cognitive problems such as poor working memory is considered as one of the main reading problems among dyslexic children. Therefore, the present study was conducted to determine the effect of Reading Assistant Package training on the working memory of dyslexic children. Methods: The present study was a single-subject research design of multiple baseline types...

متن کامل

NVIDA CUDA Architecture-Based Parallel SAT Solver

The SAT problem is the first NP-complete problem. So far there is no algorithm that can solve it in polynomial time. Over the past decade, the development of efficient and scalable algorithms has dramatically leveraged the ability of solving SAT problem instances involving tens of thousands of variables and millions of constraints. But as industry demand is increasing, a faster SAT solver is ne...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

The Demand for a Sound Baseline in GPU Memory Architecture Research

نویسندگان

چکیده

منابع مشابه

Embedded Memory Test Strategies and Repair

An approach to Improve Particle Swarm Optimization Algorithm Using CUDA

Ultra-Low-Energy DSP Processor Design for Many-Core Parallel Applications

The Effectiveness of Training of Reading Assistant Package on Dyslexic childrens’ Working Memory - a Multiple Baseline Single Case Study

NVIDA CUDA Architecture-Based Parallel SAT Solver

عنوان ژورنال:

اشتراک گذاری